Exercises and associated data

The data and modelling objects created in this notebook can be downloaded directly to save computational time.


Users who wish to complete the exercises can download a small template R script. Assuming you have already downloaded the data objects above, this script will load all data objects so that the steps used to create them are not necessary to tackle the exercises.

Load libraries and time series data

This tutorial relates to content covered in Lecture 1 and Lecture 2, and relies on the following packages for manipulating data, shaping time series, fitting dynamic regression models and plotting:

library(dplyr)
library(mvgam) 
library(gratia)
library(ggplot2)
library(marginaleffects)
library(zoo)
library(viridis)

We will work with time series of rodent captures from the Portal Project, a long-term monitoring study based near the town of Portal, Arizona. Researchers have been operating a standardized set of baited traps within 24 experimental plots at this site since the 1970’s. Sampling follows the lunar monthly cycle, with observations occurring on average about 28 days apart. However, missing observations do occur due to difficulties accessing the site (weather events, COVID disruptions etc..). You can read about the full sampling protocol in this preprint by Ernest et al on the Biorxiv.

Portal Project sampling scheme in the desert near Portal, Arizona, USA; photo by SKM Ernest
Portal Project sampling scheme in the desert near Portal, Arizona, USA; photo by SKM Ernest


All data from the Portal Project are made openly available in near real-time so that they can provide the maximum benefit to scientific research and outreach (a set of open-source software tools make data readily accessible). These data are extremely rich, containing monthly counts of rodent captures for >20 species. But rather than accessing the raw data, we will use some data that I have already processed and put into a simple, usable form

data("portal_data")

As the data come pre-loaded with the mvgam package, you can read a little about it in the help page using ?portal_data. Before working with data, it is important to inspect how the data are structured. There are various ways to inspect data in R; I typically find the head function to be convenient

head(portal_data)
##   moon DM DO PP OT year month mintemp precipitation     ndvi
## 1  329 10  6  0  2 2004     1  -9.710          37.8 1.465889
## 2  330 14  8  1  0 2004     2  -5.924           8.7 1.558507
## 3  331  9  1  2  1 2004     3  -0.220          43.5 1.337817
## 4  332 NA NA NA NA 2004     4   1.931          23.9 1.658913
## 5  333 15  8 10  1 2004     5   6.568           0.9 1.853656
## 6  334 NA NA NA NA 2004     6  11.590           1.4 1.761330

But the glimpse function in dplyr is also useful for understanding how variables are structured

dplyr::glimpse(portal_data)
## Rows: 199
## Columns: 10
## $ moon          <int> 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 3…
## $ DM            <int> 10, 14, 9, NA, 15, NA, NA, 9, 5, 8, NA, 14, 7, NA, NA, 9…
## $ DO            <int> 6, 8, 1, NA, 8, NA, NA, 3, 3, 4, NA, 3, 8, NA, NA, 3, NA…
## $ PP            <int> 0, 1, 2, NA, 10, NA, NA, 16, 18, 12, NA, 3, 2, NA, NA, 1…
## $ OT            <int> 2, 0, 1, NA, 1, NA, NA, 1, 0, 0, NA, 2, 1, NA, NA, 1, NA…
## $ year          <int> 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 20…
## $ month         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6,…
## $ mintemp       <dbl> -9.710, -5.924, -0.220, 1.931, 6.568, 11.590, 14.370, 16…
## $ precipitation <dbl> 37.8, 8.7, 43.5, 23.9, 0.9, 1.4, 20.3, 91.0, 60.5, 25.2,…
## $ ndvi          <dbl> 1.4658889, 1.5585069, 1.3378172, 1.6589129, 1.8536561, 1…

We will focus analyses on the time series of captures for one specific rodent species, the Desert Pocket Mouse Chaetodipus penicillatus. This species is interesting in that it goes into a kind of “hibernation” during the colder months, leading to very low captures during the winter period

Our study species, the Desert Pocket Mouse (Chaetodipus penicillatus); photo by SKM Ernest
Our study species, the Desert Pocket Mouse (Chaetodipus penicillatus); photo by SKM Ernest

Manipulating data for modelling

Manipulating the data into a ‘long’ format is necessary for modelling in mvgam, mgcv and brms. By ‘long’ format, we mean that each series x time observation needs to have its own entry in the dataframe or list object that we wish to use as data for modelling. A simple example can be viewed by simulating data using the sim_mvgam function. See ?sim_mvgam for more details

data <- sim_mvgam(n_series = 4, T = 24)
head(data$data_train, 12)
##    y season year   series time
## 1  8      1    1 series_1    1
## 2  5      1    1 series_2    1
## 3  2      1    1 series_3    1
## 4  3      1    1 series_4    1
## 5  2      2    1 series_1    2
## 6  5      2    1 series_2    2
## 7  0      2    1 series_3    2
## 8  1      2    1 series_4    2
## 9  3      3    1 series_1    3
## 10 5      3    1 series_2    3
## 11 1      3    1 series_3    3
## 12 1      3    1 series_4    3

Notice how we have four different time series in these simulated data, but we do not spread these out into different columns. Rather, there is only a single column for the outcome variable, labelled y in these simulated data. We also must supply a variable labelled time to ensure the modelling software knows how to arrange the time series when building models. This setup still allows us to formulate multivariate time series models, as we will see in later tutorials. Below are the steps needed to shape our portal_data object into the correct form. First, we create a time variable, select the column representing counts of our target species (PP), and select appropriate variables that we can use as predictors

portal_data %>%
  
  # mvgam requires a 'time' variable be present in the data to index
  # the temporal observations. This is especially important when tracking 
  # multiple time series. In the Portal data, the 'moon' variable indexes the
  # lunar monthly timestep of the trapping sessions
  dplyr::mutate(time = moon - (min(moon)) + 1) %>%
  
  # We can also provide a more informative name for the outcome variable, which 
  # is counts of the 'PP' species (Chaetodipus penicillatus) across all control
  # plots
  dplyr::mutate(count = PP) %>%
  
  # The other requirement for mvgam is a 'series' variable, which needs to be a
  # factor variable to index which time series each row in the data belongs to.
  # Again, this is more useful when you have multiple time series in the data
  dplyr::mutate(series = as.factor('PP')) %>%
  
  # Select the variables of interest to keep in the model_data
  dplyr::select(series, year, time, count, mintemp, ndvi) -> model_data

The data now contain six variables:
series, a factor indexing which time series each observation belongs to
year, the year of sampling
time, the indicator of which time step each observation belongs to
count, the response variable representing the number of captures of the species PP in each sampling observation
mintemp, the monthly average minimum temperature at each time step
ndvi, the monthly average Normalized Difference Vegetation Index at each time step

Now check the data structure again

head(model_data)
##   series year time count mintemp     ndvi
## 1     PP 2004    1     0  -9.710 1.465889
## 2     PP 2004    2     1  -5.924 1.558507
## 3     PP 2004    3     2  -0.220 1.337817
## 4     PP 2004    4    NA   1.931 1.658913
## 5     PP 2004    5    10   6.568 1.853656
## 6     PP 2004    6    NA  11.590 1.761330
dplyr::glimpse(model_data)
## Rows: 199
## Columns: 6
## $ series  <fct> PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP…
## $ year    <int> 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 20…
## $ time    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ count   <int> 0, 1, 2, NA, 10, NA, NA, 16, 18, 12, NA, 3, 2, NA, NA, 13, NA,…
## $ mintemp <dbl> -9.710, -5.924, -0.220, 1.931, 6.568, 11.590, 14.370, 16.520, …
## $ ndvi    <dbl> 1.4658889, 1.5585069, 1.3378172, 1.6589129, 1.8536561, 1.76132…

You can also summarize multiple variables, which is helpful to search for data ranges and identify missing values

summary(model_data)
##  series        year           time           count          mintemp       
##  PP:199   Min.   :2004   Min.   :  1.0   Min.   : 0.00   Min.   :-24.000  
##           1st Qu.:2008   1st Qu.: 50.5   1st Qu.: 2.50   1st Qu.: -3.884  
##           Median :2012   Median :100.0   Median :12.00   Median :  2.130  
##           Mean   :2012   Mean   :100.0   Mean   :15.14   Mean   :  3.504  
##           3rd Qu.:2016   3rd Qu.:149.5   3rd Qu.:24.00   3rd Qu.: 12.310  
##           Max.   :2020   Max.   :199.0   Max.   :65.00   Max.   : 18.140  
##                                          NA's   :36                       
##       ndvi       
##  Min.   :0.2817  
##  1st Qu.:1.0741  
##  Median :1.3501  
##  Mean   :1.4709  
##  3rd Qu.:1.8178  
##  Max.   :3.9126  
## 

We have some NAs in our response variable count. Let’s visualize the data as a heatmap to get a sense of where these are distributed (NAs are shown as red bars in the below plot)

image(is.na(t(model_data %>%
                dplyr::arrange(dplyr::desc(time)))), axes = F,
      col = c('grey80', 'darkred'))
axis(3, at = seq(0,1, len = NCOL(model_data)), labels = colnames(model_data))

These observations will generally be thrown out by most modelling packages in R. But as you will see when we work through the tutorials, mvgam keeps these in the data so that predictions can be automatically returned for the full dataset.

Exercises

  1. Plot the distributions of the count, mintemp and ndvi variables as histograms to get a better sense of the types of values they can take. See ?hist for guidance
  2. Calculate what proportion of observations in count are zeros. See ?which and ?== for guidance
Check here if you’re having trouble calculating the proportion of zeros in count
# the which function is useful for finding indices in a variable that fit a certain pattern
which(model_data$count == 0)

# to see how many zeros there are, calculate the length of this set of indices
length(which(model_data$count == 0))

# to calculate proportion of observations that are zero, divide this by the total number of observations
length(which(model_data$count == 0)) / length(model_data$count) 

Time series visualizations

The first thing to do in any data analysis task is to plot the data. Graphs enable many features of the data to be visualised, including patterns, unusual observations, changes over time, and relationships between variables. The features that are seen in plots of the data must then be incorporated, as much as possible, into the forecasting methods to be used” (Hyndman and Athanasopoulos; Forecasting: Principles and Practice (3rd ed))

A time series is a set of observations taken sequentially in time. When working with time series, the first step (after inspecting the data structure) is usually to plot the observations over time, as either a line plot or scatterplot

plot(x = model_data$time, y = model_data$count, type = 'l',
     xlab = 'Time', ylab = 'Counts', bty = 'l', col = 'darkred',
     lwd = 2)

This plot reveals several important features including: missing data, evidence of repeated seasonal cycles, lower bounding at zero and long-term undulations. Most of these features make it very challenging to model series like this using conventional time series / forecasting software packages. But before we discuss modelling approaches, let’s build some more visuals to look for other key features of the data

Lag and ACF plots

Lag plots are often used to inspect temporal autocorrelations at a variety of lags in the observations for a given univariate time series. While correlation measures the extent of a linear relationship between two variables, autocorrelations are used to measure linear relationships between lagged values of a time series. Ordinarily we’d want to plot our outcome of interest (count), but the default lag.plot can’t handle missing values in the time series (more on that below). But we can inspect correlations among observed NDVI values (using the ndvi variable) and lagged versions of ndvi to see evidence of strong positive autocorrelation at several lags

lag.plot(model_data$ndvi, lags = 4, pch = 16)

This function works by taking differences of the variable ndvi and plotting the lagged value (i.e. the value of the series at time \(t-lag\)) vs the observed value at time \(t\). We can replicate this process for a single lag so it is more obvious what is happening under the hood

ndvi_lag1 <- diff(model_data$ndvi, lag = 1)
plot(ndvi_lag1, model_data$ndvi[2:NROW(model_data)],
     xlab = expression("NDVI"["t-1"]), 
     ylab = expression("NDVI"["t"]), pch = 16, bty = 'l')

A Pearson’s correlation test confirms that there is a strong positive relationship between current ndvi and what ndvi was one month earlier

cor.test(ndvi_lag1, model_data$ndvi[2:NROW(model_data)])
## 
##  Pearson's product-moment correlation
## 
## data:  ndvi_lag1 and model_data$ndvi[2:NROW(model_data)]
## t = 6.736, df = 196, p-value = 1.765e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3130538 0.5403426
## sample estimates:
##       cor 
## 0.4335688

But unfortunately the lag plot will fail when the variable of interest has missing values, which our outcome of interest (count) has plenty of. Rather than using a lag plot, we can instead plot the autocorrelation function at a variety of lags. These plots are often used for deciding which moving average order to choose when building autoregressive models

acf(model_data$count, 
    na.action = na.pass, 
    bty = 'l',
    lwd = 2,
    ci.col = 'darkred',
    main = 'Counts of PP trappings')

Several features are evident from this plot: First, we observe positive autocorrelation at small lags. When time series show a trend, autocorrelations for small lags will often to be large and positive because observations nearby in time have similar values. So the ACF suggests there is evidence of a trend as autocorrelations tend to slowly decrease as the lags increase. Second, we see evidence of seasonality in captures. When data are seasonal, the autocorrelations will often be larger for the seasonal lags (at multiples of the seasonal period) than for other lags. We know this species shows evidence of torpor in the winter months, so it is not surprising to see a strong annual pattern in these autocorrelations.

pACF plots

The partial autocorrelation function shows something similar, but has controlled for the short-term lagged correlations. These plots are often used for selecting orders of autoregressive models. For example, if we choose an AR2 model then we hope our AR2 parameter will capture remaining autocorrelations at lag 2 when controlling for any autocorrelations at lag 1.

pacf(model_data$count, 
     na.action = na.pass, 
     bty = 'l',
     lwd = 2,
     ci.col = 'darkred',
     main = 'Counts of PP trappings')

Both of these plots suggest there is temporal autocorrelation present, as well as strong seasonality. We can view the ACF and some other useful features of the data with one of mvgam’s primary functions, plot_mvgam_series (see ?plot_mvgam_series for details):

plot_mvgam_series(data = model_data, y = 'count')

STL decompositions

Another standard vizualisation is to compute and plot a Seasonal Trend Decomposition. As we have seen from our example plots so far, time series data can exhibit a variety of patterns including trend and seasonality. It is often helpful to split a time series into several components, each representing an underlying pattern, to inspect features of the data that may not be immediately apparent when simply viewing the series. There are many variations of these decompositions, but the most popular is probably the STL decomposition, which decomposes a time series into seasonal, trend and irregular components using loess smoothing (see ?stl for details). But just to get to a point where we can use this with most ecological series, we have to somehow deal with missing observations. The usual practice is to impute these in some way. We can use the na.approx function from the zoo package for this, which will use linear interpolation to impute missing values. This requires that we first convert the count observations to a time series object of class ts. We set the frequency to 12 as this is close enough to what we have with our lunar monthly observation intervals. We will not be using the imputed data for any modelling, so please don’t worry too much if you aren’t up to speed on how linear interpolation works. This is purely for visualization purposes

# First convert the counts to a time series object of class ts
count_ts <- ts(model_data$count, frequency = 12)
# Impute NAs using linear interpolation
interp_series <- zoo::na.approx(count_ts)

Now we can perform the STL decomposition. This requires us to specify the seasonal window, which controls the span in lags over which the seasonal loess smooth will operate. Smaller values allow for more rapid changes by setting a smaller number of consecutive years to estimate each value in the seasonal component. A key advantage of this method is that the seasonal component is allowed to change over time. But a drawback is that this parameter must be chosen by the analyst, and changing its value can give different results. We will use a 5-year seasonal window in this first example

plot(stl(interp_series, s.window = 5))

The strong seasonality is obvious in this plot, as is the evidence of an undulating trend. There is also a lot of variation in the remainder component, which demonstrates that the seasonality and trend alone cannot capture all of the variation in the observations over time. The sizes of the bars on the right-hand side of these plots show scale to help us understand the relative magnitudes of each component. But what happens if we change the seasonal window and force it to be stable through time?

plot(stl(interp_series, s.window = 'periodic'))

Here the remainder component is even more important, and the trend has become less smooth. Try varying the s.window argument to see how the conclusions about component contributions will change. For more information on decompositions, see this chapter in Applied Time Series Analysis for Fisheries and Environmental Sciences, by Holmes, Scheuerell and Ward

Exercises

  1. Calculate cross-correlations between our outcome variable count and ndvi at lags of 1 - 15 months. This procedure is often used to help identify lags of the x-variable that might be useful predictors of the outcome. Note, you will need to use na.action = na.pass so that the missing values in count don’t cause problems. See ?ccf for details about calculating cross-correlations using type = 'correlation'. Which lags of ndvi would you consider to be the most likely candidates for predicting our outcome count?
  2. Plot an STL decomposition for the ndvi variable
  3. How might you decompose a time series using a GAM? Take a few notes to explain what steps you would take (hint, look for inspiration from this nice blog post by Gavin Simpson on additive time series models with GAMs)
Check here for template code if you’re having trouble with ccf
# fill in the "?" with the correct variable / value
ccf(x = ?,
    y = ?,
    na.action = ?, 
    lag.max = ?,
    # compute correlations at each lag
    type = 'correlation',
    # not necessary but makes for a prettier plot
    bty = 'l',
    lwd = 2,
    ci.col = 'darkred',
    main = expression(paste(italic(Cor),"(",ndvi[lag],",",count, ")")))

GLMs with temporal random effects

Now onto some modelling! Our first task will be to fit a Generalized Linear Model (GLM) that can adequately capture the features of our count observations (integer data, lower bound at zero, missing values) while also attempting to model temporal variation. We are almost ready to fit our first model, which will be a GLM with Poisson observations, a log link function and random (hierarchical) intercepts for year. This will allow us to capture our prior belief that, although each year is unique, having been sampled from the same population of effects, all years are connected and thus might contain valuable information about one another. This will be done by capitalizing on the partial pooling properties of hierarchical models:

General structure of hierarchical models; credit to Bayes Rules!, by Johnson, Ott, and Dogucu (2021); reproduced under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
General structure of hierarchical models; credit to Bayes Rules!, by Johnson, Ott, and Dogucu (2021); reproduced under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


Hierarchical (also known as random) effects offer many advantages when modelling data with grouping structures (i.e. multiple species, locations, years etc…). The ability to incorporate these in time series models is a huge advantage over traditional models such as ARIMA or Exponential Smoothing. But before we fit the model, we will need to convert year to a factor so that we can use a random effect basis in mvgam. See ?smooth.terms and ?smooth.construct.re.smooth.spec for details about the re basis construction that is used by both mvgam and mgcv

model_data %>%
  
  # Convert 'year' to a factor variable
  dplyr::mutate(year = factor(year)) -> model_data

Preview the dataset to ensure year is now a factor with a unique factor level for each year in the data

dplyr::glimpse(model_data)
## Rows: 199
## Columns: 6
## $ series  <fct> PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP, PP…
## $ year    <fct> 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 20…
## $ time    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
## $ count   <int> 0, 1, 2, NA, 10, NA, NA, 16, 18, 12, NA, 3, 2, NA, NA, 13, NA,…
## $ mintemp <dbl> -9.710, -5.924, -0.220, 1.931, 6.568, 11.590, 14.370, 16.520, …
## $ ndvi    <dbl> 1.4658889, 1.5585069, 1.3378172, 1.6589129, 1.8536561, 1.76132…
levels(model_data$year)
##  [1] "2004" "2005" "2006" "2007" "2008" "2009" "2010" "2011" "2012" "2013"
## [11] "2014" "2015" "2016" "2017" "2018" "2019" "2020"

We are now ready for our first mvgam model. The syntax will be familiar to users who have previously built models with mgcv. But for a refresher, see ?formula.gam and the examples in ?gam. Random effects can be specified using the s wrapper with the re basis. Note that we can also suppress the primary intercept using the usual R formula syntax - 1. mvgam has a number of possible observation families that can be used, see ?mvgam_families for more information. We will use Stan as the fitting engine, which deploys Hamiltonian Monte Carlo (HMC) for full Bayesian inference. By default, 4 HMC chains will be run using a warmup of 500 iterations and collecting 250 posterior samples from each chain. Ordinarily we would take more posterior samples per chain but 250 give us an object that will be easier to download for completing the exercises. Interested users should consult the Stan user’s guide for more information about the software and the enormous variety of models that can be tackled with HMC.

model1 <- mvgam(count ~ s(year, bs = 're') - 1,
                family = poisson(),
                data = model_data,
                samples = 250)

The model can be described mathematically for each timepoint \(t\) as follows: \[\begin{align*} \boldsymbol{count}_t & \sim \text{Poisson}(\lambda_t) \\ log(\lambda_t) & = \beta_{year[year_t]} * \boldsymbol{year}_{t} \\ \beta_{year} & \sim \text{Normal}(\mu_{year}, \sigma_{year}) \end{align*}\]

Where the \(\beta_{year}\) effects are drawn from a population distribution that is parameterized by a common mean \((\mu_{year})\) and variance \((\sigma_{year})\). Once the model has finished, the first step is to inspect the summary to ensure no major diagnostic warnings have been produced and to quickly summarise posterior distributions for key parameters

summary(model1)
## GAM formula:
## count ~ s(year, bs = "re") - 1
## 
## Family:
## poisson
## 
## Link function:
## log
## 
## Trend model:
## None
## 
## N series:
## 1 
## 
## N timepoints:
## 199 
## 
## Status:
## Fitted using Stan 
## 
## GAM coefficient (beta) estimates:
##                 2.5%      50%    97.5% Rhat n.eff
## s(year).1   1.822798 2.062830 2.282851 1.00  1315
## s(year).2   2.505502 2.683205 2.847867 1.00  1412
## s(year).3   2.959791 3.092155 3.208346 1.00  1281
## s(year).4   3.140479 3.253370 3.371434 1.00  1146
## s(year).5   1.954301 2.125135 2.309873 1.00  1758
## s(year).6   1.507962 1.755725 1.980168 1.00  1421
## s(year).7   1.791670 2.037620 2.265173 1.00  1003
## s(year).8   2.815735 2.966205 3.098107 1.00  1440
## s(year).9   3.133083 3.249425 3.355478 1.00  1249
## s(year).10  2.559152 2.759090 2.925266 1.00  1580
## s(year).11  2.945402 3.078145 3.198628 1.00  1337
## s(year).12  3.100725 3.210115 3.326138 1.00  1358
## s(year).13  2.046694 2.254260 2.440682 1.00  1022
## s(year).14  2.460342 2.641465 2.800332 1.00  1440
## s(year).15  1.932897 2.170270 2.378564 1.00  1475
## s(year).16  1.848599 2.077805 2.265215 1.00  1450
## s(year).17 -0.236015 1.050810 1.910172 1.01   290
## 
## GAM group-level estimates:
##                 2.5%       50%    97.5% Rhat n.eff
## mean(year) 2.0838085 2.4215150 2.748280 1.02   135
## sd(year)   0.4596017 0.6668245 1.002927 1.05    95
## 
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 1000 iterations ended with a divergence (0%)
## 0 of 1000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior

The diagnostic messages at the bottom of the summary show that the HMC sampler did not encounter any problems or difficult posterior spaces. This is a good sign. Posterior distributions for the coefficients related to the GAM linear predictor (i.e. the \(\beta\)’s) can be extracted as a data.frame or matrix as follows:

beta_post <- as.data.frame(model1, variable = 'betas')
dplyr::glimpse(beta_post)
## Rows: 1,000
## Columns: 17
## $ `s(year).1`  <dbl> 2.08641, 1.99686, 2.06547, 1.78469, 1.95998, 2.07088, 1.9…
## $ `s(year).2`  <dbl> 2.68119, 2.75029, 2.70137, 2.64142, 2.66175, 2.63268, 2.6…
## $ `s(year).3`  <dbl> 3.07068, 2.99991, 3.11479, 3.06185, 3.10808, 3.11222, 3.1…
## $ `s(year).4`  <dbl> 3.36656, 3.19007, 3.19953, 3.31882, 3.21541, 3.19610, 3.2…
## $ `s(year).5`  <dbl> 2.05605, 2.18534, 2.27209, 2.20964, 2.14903, 2.16109, 2.0…
## $ `s(year).6`  <dbl> 1.70304, 1.74661, 1.79700, 1.69840, 1.90235, 1.79550, 1.7…
## $ `s(year).7`  <dbl> 1.86708, 2.17969, 2.12229, 2.24989, 2.30511, 2.05341, 2.0…
## $ `s(year).8`  <dbl> 3.00259, 2.96931, 2.95120, 2.87897, 3.06848, 2.94030, 2.9…
## $ `s(year).9`  <dbl> 3.29020, 3.29952, 3.37849, 3.14558, 3.34278, 3.22204, 3.2…
## $ `s(year).10` <dbl> 2.61232, 2.78013, 2.80304, 2.65307, 2.86269, 2.85246, 2.7…
## $ `s(year).11` <dbl> 3.13481, 3.12531, 3.09686, 3.08571, 3.08537, 3.09990, 3.0…
## $ `s(year).12` <dbl> 3.21593, 3.13284, 3.18113, 3.25961, 3.18265, 3.21031, 3.1…
## $ `s(year).13` <dbl> 2.26007, 2.23171, 2.24981, 2.15222, 2.18645, 2.27654, 2.2…
## $ `s(year).14` <dbl> 2.58985, 2.53793, 2.59823, 2.71410, 2.65523, 2.60405, 2.5…
## $ `s(year).15` <dbl> 2.17358, 2.16870, 2.24486, 2.11155, 2.13043, 2.19803, 2.0…
## $ `s(year).16` <dbl> 2.18799, 1.96943, 2.04194, 2.12654, 2.09652, 2.05796, 2.0…
## $ `s(year).17` <dbl> 1.3031100, 0.2413670, 1.5672900, 1.3526500, 1.4232400, 0.…

With any model fitted in mvgam, the underlying Stan code can be viewed using the code function:

code(model1)
Check here to see the model’s Stan code
## // Stan model code generated by package mvgam
## data {
##   int<lower=0> total_obs; // total number of observations
##   int<lower=0> n; // number of timepoints per series
##   int<lower=0> n_series; // number of series
##   int<lower=0> num_basis; // total number of basis coefficients
##   matrix[total_obs, num_basis] X; // mgcv GAM design matrix
##   int<lower=0> ytimes[n, n_series]; // time-ordered matrix (which col in X belongs to each [time, series] observation?)
##   int<lower=0> n_nonmissing; // number of nonmissing observations
##   int<lower=0> flat_ys[n_nonmissing]; // flattened nonmissing observations
##   matrix[n_nonmissing, num_basis] flat_xs; // X values for nonmissing observations
##   int<lower=0> obs_ind[n_nonmissing]; // indices of nonmissing observations
## }
## parameters {
##   // raw basis coefficients
##   vector[num_basis] b_raw;
##   // random effect variances
##   vector<lower=0>[1] sigma_raw;
##   // random effect means
##   vector[1] mu_raw;
## }
## transformed parameters {
##   // basis coefficients
##   vector[num_basis] b;
##   b[1:17] = mu_raw[1] + b_raw[1:17] * sigma_raw[1];
## }
## model {
##   // prior for random effect population variances
##   sigma_raw ~ exponential(0.5);
##   // prior for random effect population means
##   mu_raw ~ std_normal();
##   // prior (non-centred) for s(year)...
##   b_raw[1:17] ~ std_normal();
##   {
##   // likelihood functions
##   flat_ys ~ poisson_log_glm(flat_xs,
##   0.0,b);
##   }
## }
## generated quantities {
##   vector[total_obs] eta;
##   matrix[n, n_series] mus;
##   array[n, n_series] int ypred;
##   // posterior predictions
##   eta = X * b;
##   for(s in 1:n_series){ 
##   mus[1:n, s] = eta[ytimes[1:n, s]];
##   ypred[1:n, s] = poisson_log_rng(mus[1:n, s]);
##   }
## }


Plotting effects and residuals

Now for interrogating the model. We can get some sense of the variation in yearly intercepts from the summary above, but it is easier to understand them using targeted plots. Plot posterior distributions of the temporal random effects using plot.mvgam with type = 're'. See ?plot.mvgam for more details about the types of plots that can be produced from fitted mvgam objects

plot(model1, type = 're')

How do these yearly intercepts translate into time-varying predictions? To understand this, we can plot posterior hindcasts from this model for the training period using plot.mvgam with type = 'forecast'

plot(model1, type = 'forecast')

If you wish to extract these hindcasts for other downstream analyses, the hindcast function can be used. This will return a list object of class mvgam_forecast. In the hindcasts slot, a matrix of posterior retrodictions will be returned for each series in the data (only one series in our example):

hc <- hindcast(model1)
str(hc)
## List of 13
##  $ call              :Class 'formula'  language count ~ s(year, bs = "re") - 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##  $ trend_call        : NULL
##  $ family            : chr "poisson"
##  $ trend_model       : chr "None"
##  $ drift             : logi FALSE
##  $ use_lv            : logi FALSE
##  $ fit_engine        : chr "stan"
##  $ type              : chr "response"
##  $ series_names      : chr "PP"
##  $ train_observations:List of 1
##   ..$ PP: int [1:199] 0 1 2 NA 10 NA NA 16 18 12 ...
##  $ test_observations : NULL
##  $ hindcasts         :List of 1
##   ..$ PP: num [1:1000, 1:199] 6 9 6 3 6 7 5 9 8 9 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : NULL
##   .. .. ..$ : chr [1:199] "ypred[1,1]" "ypred[2,1]" "ypred[3,1]" "ypred[4,1]" ...
##  $ forecasts         : NULL
##  - attr(*, "class")= chr "mvgam_forecast"

You can also extract these hindcasts on the linear predictor scale, which in this case is the log scale (our Poisson GLM used a log link function). Sometimes this can be useful for asking more targeted questions about drivers of variation:

hc <- hindcast(model1, type = 'link')
range(hc$hindcasts$PP)
## [1] -1.41752  3.45414

Objects of class mvgam_forecast have an associated plot function as well:

plot(hc)

This plot can look a bit confusing as it seems like there is linear interpolation from the end of one year to the start of the next. But this is just due to the way the lines are automatically connected in base R plots

In any regression analysis, a key question is whether the residuals show any patterns that can be indicative of un-modelled sources of variation. For GLMs, we can use a modified residual called the Dunn-Smyth, or randomized quantile, residual. Inspect Dunn-Smyth residuals from the model using plot.mvgam with type = 'residuals'

plot(model1, type = 'residuals')

Exercises

  1. Inspect the residual plot and make some notes about any features that stand out to you. Do these diagnostics look reasonable to you? How might you rectify these issues by modifying the model?
  2. The model contains posterior draws of Dunn-Smyth residuals that can be used for further investigations. For each series in the data (only one series in this case), residuals are in a matrix of dimension n_draws x n_timepoints. Calculate posterior median residuals per time point and plot these as a line plot (hint, use ?apply and either ?median or ?quantile for guidance). Add a dashed horizontal line at 0 (hint, use ?abline for guidance). Do these residuals look worrisome to you?
Check here for template code if you’re having trouble plotting median residuals over time plot
# fill in the "?" with the correct variable / value
# extract residuals for our rodent species of interest
model1_resids <- model1$resids$PP

# check the dimensions of the residuals matrix
dim(model1_resids)

# calculate posterior median residuals per timepoint
median_resids <- apply(model1_resids,
                       MARGIN = ?, 
                       FUN = ?)

# plot median residuals over time
plot(median_resids,
     type = 'l',
     # not necessary but makes for a prettier plot
     lwd = 2,
     col = 'darkred',
     bty = 'l',
     ylab = 'Median residual',
     xlab = 'Time')

# add a horizontal dashed line at zero
abline(? = 0, lty = 'dashed')

Automatic forecasting for new data

These temporal random effects do not have a sense of “time”. Because of this, each yearly random intercept is not restricted in some way to be similar to the previous yearly intercept. This drawback becomes evident when we predict for a new year. To do this, we can repeat the exercise above but this time will split the data into training and testing sets before re-running the model. We can then supply the test set as newdata. For splitting, we will make use of the filter function from dplyr

model_data %>% 
  dplyr::filter(time <= 160) -> data_train 
model_data %>% 
  dplyr::filter(time > 160) -> data_test
model1b <- mvgam(count ~ s(year, bs = 're') - 1,
                family = poisson(),
                data = data_train,
                newdata = data_test)

Repeating the plots above gives insight into how the model’s hierarchical prior formulation provides all the structure needed to sample values for un-modelled years

plot(model1b, type = 're')

plot(model1b, type = 'forecast')

We can also view the test data in the forecast plot to see that the predictions do not capture the temporal variation in the test set

plot(model1b, type = 'forecast', newdata = data_test)

## Out of sample DRPS:
## [1] 181.994

As with the hindcast function, we can use the forecast function to automatically extract the posterior distributions for these predictions. This also returns an object of class mvgam_forecast, but now it will contain both the hindcasts and forecasts for each series in the data:

fc <- forecast(model1b)
str(fc)
## List of 13
##  $ call              :Class 'formula'  language count ~ s(year, bs = "re") - 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##  $ trend_call        : NULL
##  $ family            : chr "poisson"
##  $ trend_model       : chr "None"
##  $ drift             : logi FALSE
##  $ use_lv            : logi FALSE
##  $ fit_engine        : chr "stan"
##  $ type              : chr "response"
##  $ series_names      : chr "PP"
##  $ train_observations:List of 1
##   ..$ PP: int [1:160] 0 1 2 NA 10 NA NA 16 18 12 ...
##  $ test_observations :List of 1
##   ..$ PP: int [1:39] NA 0 0 10 3 14 18 NA 28 46 ...
##  $ hindcasts         :List of 1
##   ..$ PP: num [1:2000, 1:160] 12 4 7 5 6 10 5 11 12 6 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : NULL
##   .. .. ..$ : chr [1:160] "ypred[1,1]" "ypred[2,1]" "ypred[3,1]" "ypred[4,1]" ...
##  $ forecasts         :List of 1
##   ..$ PP: num [1:2000, 1:39] 6 8 8 7 16 10 7 10 10 8 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : NULL
##   .. .. ..$ : chr [1:39] "ypred[161,1]" "ypred[162,1]" "ypred[163,1]" "ypred[164,1]" ...
##  - attr(*, "class")= chr "mvgam_forecast"

Adding predictors as “fixed” effects

Any users familiar with GLMs will know that we nearly always wish to include predictor variables that may explain some of the variation in our observations. This is also true when working with time series, though the inclusion of predictors immediately rules out some of the more popular forecasting algorithms (such as Exponential Smoothing). Predictors are easily incorporated into GLMs / GAMs. Here, we will update the model from above by including a parametric (fixed) effect of ndvi as a linear predictor:

model2 <- mvgam(count ~ s(year, bs = 're') + 
                  ndvi - 1,
                family = poisson(),
                data = data_train,
                newdata = data_test)

The model can be described mathematically as follows: \[\begin{align*} \boldsymbol{count}_t & \sim \text{Poisson}(\lambda_t) \\ log(\lambda_t) & = \beta_{year[year_t]} * \boldsymbol{year}_{t} + \beta_{ndvi} * \boldsymbol{ndvi}_t \\ \beta_{year} & \sim \text{Normal}(\mu_{year}, \sigma_{year}) \\ \beta_{ndvi} & \sim \text{Normal}(0, 1) \end{align*}\]

Where the \(\beta_{year}\) effects are the same as before but we now have another predictor \((\beta_{ndvi})\) that applies to the ndvi value at each timepoint \(t\). Inspect the summary of this model

summary(model2)
## GAM formula:
## count ~ s(year, bs = "re") + ndvi - 1
## 
## Family:
## poisson
## 
## Link function:
## log
## 
## Trend model:
## None
## 
## N series:
## 1 
## 
## N timepoints:
## 160 
## 
## Status:
## Fitted using Stan 
## 
## GAM coefficient (beta) estimates:
##                 2.5%       50%     97.5% Rhat n.eff
## ndvi       0.3255489 0.3824155 0.4395289    1  1782
## s(year).1  1.1498320 1.4245500 1.6717675    1  2265
## s(year).2  1.8225573 2.0132400 2.2007708    1  2293
## s(year).3  2.2176740 2.3968850 2.5633597    1  1997
## s(year).4  2.3476095 2.5174900 2.6813202    1  2037
## s(year).5  1.2173385 1.4370300 1.6506880    1  2158
## s(year).6  1.0371762 1.2814150 1.5285705    1  2750
## s(year).7  1.1456842 1.4278850 1.6655193    1  2503
## s(year).8  2.1095690 2.2862450 2.4504140    1  2057
## s(year).9  2.7279203 2.8594750 2.9909630    1  2098
## s(year).10 1.9939265 2.1953300 2.3817735    1  2310
## s(year).11 2.2980020 2.4476550 2.5978207    1  1972
## s(year).12 2.5546960 2.7015400 2.8423405    1  1958
## s(year).13 1.3984955 1.6277350 1.8518907    1  2574
## s(year).14 0.6581002 2.0028750 3.3365632    1  1710
## s(year).15 0.7458807 1.9691400 3.2737317    1  1402
## s(year).16 0.7341640 1.9947950 3.3130235    1  1619
## s(year).17 0.6248627 1.9674450 3.2472470    1  1675
## 
## GAM group-level estimates:
##                 2.5%       50%    97.5% Rhat n.eff
## mean(year) 1.6206900 1.9998050 2.358635 1.01   412
## sd(year)   0.4041816 0.6019445 1.041841 1.00   470
## 
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 2000 iterations ended with a divergence (0%)
## 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior

Rather than printing the summary each time, we can also quickly look at the posterior empirical quantiles for the fixed effect of ndvi (and other linear predictor coefficients) using coef:

coef(model2)
##                 2.5%       50%     97.5% Rhat n.eff
## ndvi       0.3255489 0.3824155 0.4395289    1  1782
## s(year).1  1.1498320 1.4245500 1.6717675    1  2265
## s(year).2  1.8225573 2.0132400 2.2007708    1  2293
## s(year).3  2.2176740 2.3968850 2.5633597    1  1997
## s(year).4  2.3476095 2.5174900 2.6813202    1  2037
## s(year).5  1.2173385 1.4370300 1.6506880    1  2158
## s(year).6  1.0371762 1.2814150 1.5285705    1  2750
## s(year).7  1.1456842 1.4278850 1.6655193    1  2503
## s(year).8  2.1095690 2.2862450 2.4504140    1  2057
## s(year).9  2.7279203 2.8594750 2.9909630    1  2098
## s(year).10 1.9939265 2.1953300 2.3817735    1  2310
## s(year).11 2.2980020 2.4476550 2.5978207    1  1972
## s(year).12 2.5546960 2.7015400 2.8423405    1  1958
## s(year).13 1.3984955 1.6277350 1.8518907    1  2574
## s(year).14 0.6581002 2.0028750 3.3365632    1  1710
## s(year).15 0.7458807 1.9691400 3.2737317    1  1402
## s(year).16 0.7341640 1.9947950 3.3130235    1  1619
## s(year).17 0.6248627 1.9674450 3.2472470    1  1675

Look at the estimated effect of ndvi using plot.mvgam with type = 'pterms'

plot(model2, type = 'pterms')

This plot indicates a positive linear effect of ndvi on log(counts). But it may be easier to visualise using a histogram, especially for parametric (linear) effects. This can be done by first extracting the posterior coefficients as we did in the first example:

beta_post <- as.data.frame(model2, variable = 'betas')
dplyr::glimpse(beta_post)
## Rows: 2,000
## Columns: 18
## $ ndvi         <dbl> 0.407975, 0.448711, 0.312580, 0.367814, 0.393308, 0.37548…
## $ `s(year).1`  <dbl> 1.34033, 1.41454, 1.44933, 1.43563, 1.34351, 1.33825, 1.5…
## $ `s(year).2`  <dbl> 2.01664, 1.89794, 2.13148, 2.01000, 1.90806, 1.98100, 2.1…
## $ `s(year).3`  <dbl> 2.40344, 2.14570, 2.41497, 2.44216, 2.31749, 2.42674, 2.4…
## $ `s(year).4`  <dbl> 2.36290, 2.29836, 2.76185, 2.61542, 2.54164, 2.42037, 2.5…
## $ `s(year).5`  <dbl> 1.45927, 1.27385, 1.69627, 1.35593, 1.54548, 1.40788, 1.6…
## $ `s(year).6`  <dbl> 1.28766, 1.30079, 1.34980, 1.30679, 1.43021, 1.24454, 1.3…
## $ `s(year).7`  <dbl> 1.53067, 1.34581, 1.53425, 1.35311, 1.40810, 1.46172, 1.4…
## $ `s(year).8`  <dbl> 2.20394, 2.11864, 2.49457, 2.28829, 2.19045, 2.35497, 2.2…
## $ `s(year).9`  <dbl> 2.82041, 2.82295, 2.91648, 2.84289, 2.78895, 2.89095, 2.8…
## $ `s(year).10` <dbl> 2.06247, 2.15131, 2.29958, 2.20493, 2.14252, 2.19779, 2.2…
## $ `s(year).11` <dbl> 2.42252, 2.36058, 2.58358, 2.40929, 2.39677, 2.45329, 2.4…
## $ `s(year).12` <dbl> 2.67335, 2.60469, 2.81059, 2.67735, 2.70482, 2.69611, 2.7…
## $ `s(year).13` <dbl> 1.50715, 1.64267, 1.64778, 1.61537, 1.42617, 1.59902, 1.6…
## $ `s(year).14` <dbl> 2.16657, 2.36654, 1.03897, 1.74624, 2.87284, 1.49705, 1.3…
## $ `s(year).15` <dbl> 1.58433, 1.17143, 3.47352, 2.88605, 3.30565, 1.32484, 1.8…
## $ `s(year).16` <dbl> 2.99940, 2.45124, 1.67960, 3.10844, 2.14077, 2.74463, 1.9…
## $ `s(year).17` <dbl> 1.838660, 1.985060, 2.273010, 2.939800, 1.818110, 1.99345…

The posterior distribution for the effect of ndvi is stored in the ndvi column. A quick histogram confirms our inference that log(counts) respond positively to increases in ndvi:

hist(beta_post$ndvi,
     xlim = c(-1 * max(abs(beta_post$ndvi)),
              max(abs(beta_post$ndvi))),
     col = 'darkred',
     border = 'white',
     xlab = expression(beta[NDVI]),
     ylab = '',
     yaxt = 'n',
     main = '',
     lwd = 2)
abline(v = 0, lwd = 2.5)

Given our model used a nonlinear link function (log link in this example), it can still be difficult to fully understand what relationship our model is estimating between a predictor and the response. Fortunately, the marginaleffects package makes this relatively straightforward. Objects of class mvgam can be used with marginaleffects to inspect contrasts, scenario-based predictions, conditional and marginal effects, all on the outcome scale. Here we will use the plot_predictions function from marginaleffects to inspect the conditional effect of ndvi (use ?plot_predictions for guidance on how to modify these plots):

plot_predictions(model2, 
                 condition = "ndvi",
                 # include the observed count values
                 # as points, and show rugs for the observed
                 # ndvi and count values on the axes
                 points = 0.5, rug = TRUE) +
  theme_classic()
## Warning: Removed 28 rows containing missing values (geom_point).

Now it is easier to get a sense of the nonlinear but positive relationship estimated between ndvi and count. Plotting on the link scale should give an almost identical plot to the pterms plot from mvgam above, which shows the linear effect on the link scale:

plot_predictions(model2, 
                 condition = "ndvi",
                 type = 'link') +
  theme_classic()

But has this predictor made an impact on our estimated hindcasts and forecasts? Visualise these plots as we did previously

plot(model2, type = 'forecast', newdata = data_test)

## Out of sample DRPS:
## [1] 150.6624

These predictions certainly look different. Are the residual diagnostics improved?

plot(model2, type = 'residuals')

Exercises

  1. Make a scatterplot comparing ndvi and our outcome count
  2. Add mintemp as a second predictor. Be sure to first check whether there is a strong pairwise correlation between mintemp and ndvi (hint, use ?cor.test for guidance)
  3. Does the model estimate any strong effect of mintemp on log(counts)? Or does inclusion of mintemp lead to different inference on the effect of ndvi compared to when mintemp was not in the model?

Random nonlinear functions of time

The final point to cover in this tutorial is to incorporate nonlinear functions of time in place of the temporal yearly random effects. Nonlinear splines are commonly viewed as variations of random effects in which the coefficients that control the shape of the spline are drawn from a joint, penalized distribution. This strategy is very often used in ecological time series analysis to capture smooth temporal variation in the processes we seek to study. The lectures (particularly lecture 2) have gone into detail about smoothing splines and basis functions, but it is worth quickly re-iterating some of this detail here. When we construct smoothing splines, the workhorse package mgcv will calculate a set of basis functions that will collectively control the shape and complexity of the resulting spline. It is often helpful to visualize these basis functions to get a better sense of how splines work. We’ll create a set of 6 basis functions to represent possible variation in the effect of time on our outcome. This can easily be done using functions in the gratia package:

basis_funs <- basis(s(time, bs = 'tp', k = 6), 
                            data = model_data)

Once the functions are created, the next step is to plot them (using ggplot2 utilities)

draw(basis_funs) +
    theme_classic() +
  labs(y = 'Basis function value', x = 'Time')

Upping the potential complexity of the smooth function, by increasing k (to 10 in this example), leads to a larger diversity of basis functions:

draw(basis(s(time, bs = 'tp', k = 10), 
                            data = model_data)) +
    theme_classic() +
  labs(y = 'Basis function value', x = 'Time')

Check here for code to produce and plot basis functions using base R routines
# Set up the smooth object that is needed to predict values of
# basis functions
temp_gam <- gam(count ~ s(time, bs = 'tp', k = 6),
                data = model_data,
                family = poisson(), fit = FALSE)

# We can understand how this penalty matrix is used to 
# translate values of `time` into basis functions by 
# generating a sequence of `time` values for sampling 
# prior function realisations
cov_seq <- seq(min(model_data$time),
               max(model_data$time),
               length.out = 500)
summary(cov_seq)

# Generate the linear predictor matrix for this 
# sequence of `time` values
Xp <- mgcv::PredictMat(temp_gam$smooth[[1]],
                       data.frame(time = cov_seq))
str(Xp)

# This linear predictor contains 5 columns, one for each 
# function; the sequences of values in the columns represent 
# the actual basis functions over the range of simulated 
# `time` values
par(mar = c(4,4,2,1.5))
plot(1, type = 'n',
     ylim = range(Xp),
     xlim = range(cov_seq),
     bty = 'l',
     ylab = 'Basis function value',
     xlab = 'Time')
cols <- viridis::viridis(n = NCOL(Xp), alpha = 0.8)
for(i in 1:NCOL(Xp)){
  lines(x = cov_seq, y = Xp[,i], lwd = 3, col = 'white')
  lines(x = cov_seq, y = Xp[,i], lwd = 2.5, col = cols[i])
}
box(bty = 'l', lwd = 2)

Smooths in mvgam

In addition to constructing the basis functions, mgcv also creates a penalty matrix \(S\), which contains known coefficients that work to constrain the wiggliness of the resulting smooth function. When fitting a GAM to data, we must estimate the smoothing parameters (\(\lambda\)) that will penalize these matrices, resulting in constrained basis coefficients and smoother functions that are less likely to overfit the data. This is the key to fitting GAMs in a Bayesian framework, as we can jointly estimate the \(\lambda\)’s using informative priors to prevent overfitting and expand the complexity of models we can tackle. To see this in practice, we can now fit a model that replaces the yearly random effects with a smooth function of time. We will need a reasonably complex function (large k) to try and accommodate the temporal variation in our observations. Following some useful advice by Gavin Simpson, we will use a b-spline basis for the temporal smooth. Because we no longer have intercepts for each year, we also retain the primary intercept term in this model (there is no -1 in the formula now):

model3 <- mvgam(count ~ s(time, bs = 'bs', k = 15) + 
                  ndvi + 
                  mintemp,
                family = poisson(),
                data = data_train,
                newdata = data_test)

The model can be described mathematically as follows: \[\begin{align*} \boldsymbol{count}_t & \sim \text{Poisson}(\lambda_t) \\ log(\lambda_t) & = f(\boldsymbol{time})_t + \beta_{ndvi} * \boldsymbol{ndvi}_t + \beta_{mintemp} * \boldsymbol{mintemp}_t \\ f(\boldsymbol{time}) & = \sum_{k=1}^{K}b * \beta_{smooth} \\ \beta_{smooth} & \sim \text{MVNormal}(0, (\Omega * \lambda)^{-1}) \\ \beta_{ndvi} & \sim \text{Normal}(0, 1) \\ \beta_{mintemp} & \sim \text{Normal}(0, 1) \end{align*}\]

Where the smooth function \(f_{time}\) is built by summing across a set of weighted basis functions. The basis functions \((b)\) are constructed using a thin plate regression basis in mgcv. The weights \((\beta_{smooth})\) are drawn from a penalized multivariate normal distribution where the precision matrix \((\Omega\)) is multiplied by a smoothing penalty \((\lambda)\). If \(\lambda\) becomes large, this acts to squeeze the covariances among the weights \((\beta_{smooth})\), leading to a less wiggly spline. Note that sometimes there are multiple smoothing penalties that contribute to the covariance matrix, but I am only showing one here for simplicity. View the summary as before

summary(model3)
## GAM formula:
## count ~ s(time, bs = "bs", k = 15) + ndvi + mintemp
## 
## Family:
## poisson
## 
## Link function:
## log
## 
## Trend model:
## None
## 
## N series:
## 1 
## 
## N timepoints:
## 160 
## 
## Status:
## Fitted using Stan 
## 
## GAM coefficient (beta) estimates:
##                    2.5%          50%       97.5% Rhat n.eff
## (Intercept)  2.15320475  2.256000000  2.36749650 1.00  1031
## ndvi        -0.07673098 -0.009731935  0.05071218 1.00  1029
## mintemp      0.05948835  0.064020050  0.06877945 1.00  2196
## s(time).1   -1.77148775 -0.764100000  0.20580517 1.01   647
## s(time).2    0.02171737  0.901522500  1.84519475 1.01   452
## s(time).3   -0.41182568  0.574858500  1.49643200 1.01   415
## s(time).4    1.74270075  2.646480000  3.61237875 1.01   389
## s(time).5   -1.27019600 -0.266582000  0.63350927 1.01   471
## s(time).6   -0.74920732  0.180378000  1.16820250 1.01   410
## s(time).7   -1.41744850 -0.420694500  0.54162530 1.01   467
## s(time).8    0.39388228  1.312325000  2.25009000 1.01   409
## s(time).9    0.75215325  1.673785000  2.56390650 1.01   386
## s(time).10  -0.45106705  0.516454500  1.44336025 1.01   401
## s(time).11   0.55381472  1.500000000  2.38252025 1.01   394
## s(time).12   0.41485400  1.245255000  2.10001425 1.01   400
## s(time).13  -1.69336700 -0.823947500  0.02239245 1.01   517
## s(time).14  -5.94188325 -3.024810000 -0.02992696 1.01   425
## 
## GAM observation smoothing parameter (rho) estimates:
##                2.5%       50%    97.5% Rhat n.eff
## s(time)  0.05031779 0.9813405 1.764595 1.00  1291
## s(time)2 0.40759215 3.0890350 4.162769 1.01   643
## 
## Stan MCMC diagnostics:
## n_eff / iter looks reasonable for all parameters
## Rhat looks reasonable for all parameters
## 0 of 2000 iterations ended with a divergence (0%)
## 0 of 2000 iterations saturated the maximum tree depth of 12 (0%)
## E-FMI indicated no pathological behavior

The summary above now contains posterior estimates for the smoothing parameters as well as the basis coefficients for the nonlinear effect of time. We can visualize the conditional time effect using the plot function with type = 'smooths':

plot(model3, type = 'smooths')

By default this plots shows posterior empirical quantiles, but it can also be helpful to view some realizations of the underlying function (here, each line is a different potential curve drawn from the posterior of all possible curves):

plot(model3, type = 'smooths', realisations = TRUE,
     n_realisations = 30)

Derivatives of smooths

A useful question when modelling using GAMs is to identify where the function is changing most rapidly. To address this, we can plot estimated 1st derivatives of the spline:

plot(model3, type = 'smooths', derivatives = TRUE)

Here, values above >0 indicate the function was increasing at that time point, while values <0 indicate the function was declining. The most rapid declines appear to have been happening around timepoints 50 and again toward the end of the training period, for example.

Conditional smooths

We can use plot_predictions to view the conditional smooth of time on the scale of the outcome variable:

plot_predictions(model3, 
                 condition = "time",
                 points = 0.5, rug = TRUE) +
  theme_classic()
## Warning: Removed 28 rows containing missing values (geom_point).

Inspect the underlying Stan code to gain some idea of how the spline is being penalized:

code(model3)
## // Stan model code generated by package mvgam
## data {
##   int<lower=0> total_obs; // total number of observations
##   int<lower=0> n; // number of timepoints per series
##   int<lower=0> n_sp; // number of smoothing parameters
##   int<lower=0> n_series; // number of series
##   int<lower=0> num_basis; // total number of basis coefficients
##   vector[num_basis] zero; // prior locations for basis coefficients
##   real p_taus[3]; // prior precisions for parametric coefficients
##   real p_coefs[3]; // prior locations for parametric coefficients
##   matrix[total_obs, num_basis] X; // mgcv GAM design matrix
##   int<lower=0> ytimes[n, n_series]; // time-ordered matrix (which col in X belongs to each [time, series] observation?)
##   matrix[14,28] S1; // mgcv smooth penalty matrix S1
##   int<lower=0> n_nonmissing; // number of nonmissing observations
##   int<lower=0> flat_ys[n_nonmissing]; // flattened nonmissing observations
##   matrix[n_nonmissing, num_basis] flat_xs; // X values for nonmissing observations
##   int<lower=0> obs_ind[n_nonmissing]; // indices of nonmissing observations
## }
## parameters {
##   // raw basis coefficients
##   vector[num_basis] b_raw;
##   // smoothing parameters
##   vector<lower=0>[n_sp] lambda;
## }
## transformed parameters {
##   // basis coefficients
##   vector[num_basis] b;
##   b[1:num_basis] = b_raw[1:num_basis];
## }
## model {
##   // parametric effect priors (regularised for identifiability)
##   for (i in 1:3) {
##   b_raw[i] ~ normal(p_coefs[i], sqrt(1 / p_taus[i]));
##   }
##   // prior for s(time)...
##   b_raw[4:17] ~ multi_normal_prec(zero[4:17],S1[1:14,1:14] * lambda[1] + S1[1:14,15:28] * lambda[2]);
##   // priors for smoothing parameters
##   lambda ~ normal(10, 25);
##   {
##   // likelihood functions
##   flat_ys ~ poisson_log_glm(flat_xs,
##   0.0,b);
##   }
## }
## generated quantities {
##   vector[total_obs] eta;
##   matrix[n, n_series] mus;
##   vector[n_sp] rho;
##   array[n, n_series] int ypred;
##   rho = log(lambda);
##   // posterior predictions
##   eta = X * b;
##   for(s in 1:n_series){ 
##   mus[1:n, s] = eta[ytimes[1:n, s]];
##   ypred[1:n, s] = poisson_log_rng(mus[1:n, s]);
##   }
## }

The line below // prior for s(time)... shows how the spline basis coefficients are drawn from a zero-centred multivariate normal distribution. The precision matrix \(S\) is penalized by two different smoothing parameters (the \(\lambda\)’s) to enforce smoothness and reduce overfitting. In the next tutorial, we will expand on temporal smooths and understand their weaknesses for generating out of sample forecasts

Exercises

  1. Change the ndvi effect to a smooth function of ndvi using the s() argument (hint, see ?s for guidance on how to define smooth terms in GAM formulae using mgcv syntax). Inspect the resulting function estimates for both ndvi and time
  2. Try increasing the complexity of the temporal smooth by changing k to a larger number, something like 50. Do you get any sense that there are problems with fitting this model? Do any of your conclusions change?
  3. Inspect model residuals to look for any remaining un-modelled temporal patterns. What other steps might we take to address these?
  4. Look over a nice blog post by Gavin Simpson on strategies for fitting nonlinear interactive effects with GAMs to model changing seasonal patterns in time series; in particular, this post begins addressing the issue of autocorrelated errors, which will be a prominent feature of our next tutorial

Session Info

sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Australia.utf8  LC_CTYPE=English_Australia.utf8   
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] viridis_0.6.2          viridisLite_0.4.1      zoo_1.8-11            
##  [4] ggplot2_3.3.6          gratia_0.8.1           mvgam_1.0.5           
##  [7] insight_0.19.3         marginaleffects_0.13.0 mgcv_1.8-41           
## [10] nlme_3.1-157           dplyr_1.0.10           downloadthis_0.3.2    
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.0-3     ellipsis_0.3.2       ggridges_0.5.3      
##   [4] brms_2.19.0          markdown_1.1         base64enc_0.1-3     
##   [7] fs_1.6.1             rstudioapi_0.14      farver_2.1.1        
##  [10] rstan_2.26.13        smooth_3.1.6         DT_0.26             
##  [13] fansi_1.0.3          mvtnorm_1.1-3        lubridate_1.8.0     
##  [16] bridgesampling_1.1-2 codetools_0.2-18     splines_4.2.1       
##  [19] cachem_1.0.7         knitr_1.42           texreg_1.38.6       
##  [22] shinythemes_1.2.0    greybox_1.0.5        bayesplot_1.9.0     
##  [25] jsonlite_1.8.4       nloptr_2.0.3         bsplus_0.1.4        
##  [28] shiny_1.7.2          httr_1.4.4           compiler_4.2.1      
##  [31] backports_1.4.1      assertthat_0.2.1     Matrix_1.5-1        
##  [34] fastmap_1.1.0        cli_3.4.0            later_1.3.0         
##  [37] htmltools_0.5.5      prettyunits_1.1.1    tools_4.2.1         
##  [40] igraph_1.3.4         coda_0.19-4          gtable_0.3.1        
##  [43] glue_1.6.2           reshape2_1.4.4       posterior_1.3.1     
##  [46] V8_4.2.1             Rcpp_1.0.10          jquerylib_0.1.4     
##  [49] vctrs_0.6.2          crosstalk_1.2.0      tensorA_0.36.2      
##  [52] xfun_0.38            stringr_1.5.0        ps_1.7.1            
##  [55] collapse_1.9.6       mime_0.12            miniUI_0.1.1.1      
##  [58] lifecycle_1.0.3      gtools_3.9.3         statmod_1.4.37      
##  [61] klippy_0.0.0.9500    MASS_7.3-57          scales_1.2.1        
##  [64] colourpicker_1.2.0   promises_1.2.0.1     Brobdingnag_1.2-9   
##  [67] parallel_4.2.1       inline_0.3.19        shinystan_2.6.0     
##  [70] yaml_2.3.6           mvnfast_0.2.8        curl_4.3.2          
##  [73] pbapply_1.5-0        gridExtra_2.3        loo_2.5.1           
##  [76] StanHeaders_2.26.13  sass_0.4.5           stringi_1.7.12      
##  [79] highr_0.10           dygraphs_1.1.1.6     checkmate_2.1.0     
##  [82] pkgbuild_1.3.1       zip_2.2.2            rlang_1.1.0         
##  [85] pkgconfig_2.0.3      matrixStats_0.62.0   distributional_0.3.1
##  [88] pracma_2.3.8         evaluate_0.20        lattice_0.20-45     
##  [91] purrr_0.3.4          labeling_0.4.2       rstantools_2.2.0    
##  [94] patchwork_1.1.2      htmlwidgets_1.5.4    tidyselect_1.2.0    
##  [97] processx_3.7.0       plyr_1.8.7           magrittr_2.0.3      
## [100] R6_2.5.1             generics_0.1.3       DBI_1.1.3           
## [103] withr_2.5.0          pillar_1.8.1         xts_0.12.2          
## [106] abind_1.4-5          tibble_3.1.8         crayon_1.5.2        
## [109] utf8_1.2.2           rmarkdown_2.21       grid_4.2.1          
## [112] data.table_1.14.4    callr_3.7.3          threejs_0.3.3       
## [115] digest_0.6.29        xtable_1.8-4         tidyr_1.2.1         
## [118] httpuv_1.6.9         RcppParallel_5.1.5   stats4_4.2.1        
## [121] munsell_0.5.0        bslib_0.4.2          shinyjs_2.1.0